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Abstract —Power consumption is an concern on modern com¬ 
puting, especially on high performance computing (HPC) due 
to maintain costs and ambient impact. Designing energetically 
efficient systems is essential and since a lot of HPC tools are batch 
applications. We develop a method for executing energetically 
efficient batch applications controlling the cpu frequency and 
number of active cores. This method differs from frequency scale 
algorithms that do not have any previous information such as 
total execution time or the power consumed. We developed a 
characterization for the runtime of the application using artificial 
intelligence and also modeling the power consumption of the 
system. With this we can compute the energy as a function of 
frequency, number of active cores and input size and minimize 
finding the configuration that minimizes the total consumed. 

We were able to run applications consuming less power than 
an energy saving method that exists in the Linux system. 

Index Terms —Energy efficiency, Power modeling, Application 
characterization. 

I. Introduction 

The processor is the main contributor to power draw on 
servers as shown in [1], [7] they contribute between 20-40% 
to the total servers power draw. Also Google servers showed 
that during peak utilization processors consumed about 57% of 
the total servers power consumption. Reducing the processor 
consumption is really effective on reducing the whole system 
consumption and there are many researches in these area [2], 
[6], [9], [10], [13], [17], [19]. 

Modern processors incorporates several features for power 
management like core independent units that can be disabling, 
clock gating technique used in many synchronous circuits 
for reducing dynamic power dissipation, but one of the most 
efficient is Dynamic Voltage and Frequency Scaling (DVFS) 
with allows the control of the voltage and frequency on 
hardware or software controlled by the operating system. 

In this paper we going to implement of a schedule algorithm 
that adopts frequency scaling and variation on number of cores 
active to minimize the total energy consumed. This requires 
that the system have access to real-time power information. In 
our experiment we use a set of sensors provide by the Intelli¬ 
gent Platform Management Interface (IPMI) specification they 
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can monitor variables such as system temperatures, voltages, 
fans and power supplies. 

This algorithm use a characterization of the runtime of the 
applications and a model of the power consumption of the 
system. 

The power model was based on the equations described on 
[15] that model the power consumption of the logic gates in 
function of the operating frequency. We include the number 
of cores and sockets detailed on section IV. 

The characterization describe on the section III has the focus 
on batch applications where the use interact only providing an 
initial input. These are common on HPC for example weather 
forecast, fluid simulation and particle simulations. The idea 
is to predict the time of execution that a application will 
need for a given configuration, the inputs to this model are 
frequency, number of active cores and application input size. 
This is done using supervised learning method, Support vector 
machine (SVM). 

SVM [16], [18] are supervised models associated with 
learning algorithms that analyze data used for classification 
but can be extended to solve regression problems. This method 
is called Support Vector Regression (SVR). 

Furthermore we going to validate the results in section VII 
with the PARSEC applications and compare to existent power 
save algorithm on Linux. 

II. Methodology 

A. The setup 

Our experiments system consists of two processor Intel 
Xeon E5-2620 v3 with 16 cores along with 2 hardware threads 
for each core. The maximum non turbo frequency is 2.3 GHz 
and the total physical memory of 128GB(8 x 16GB). We 
disable Turbo Boost and hyper-threading for all experiments 
and the operating system is a Centos 6.0 with a Linux kernel 
2.6.32. 

On Linux system scaling governors are policies that decides 
what frequency should be used. The default driver acpi-cpufreq 
provides generic governors such as Performance, Powersave, 
Ondemand, Conservative and Userspace. Performance and 
Powersave are static wich set to the maximum and minimum 
allowed frequency respectively. Ondemand and Conservative 



implementing algorithms to estimate the CPU required capac¬ 
ity and adjust the processor frequencies according and the 
Userspace allow the user to specify the frequency. 

To modify the CPU clock frequency we develop a python 
module (https://github.com/VitorRamos/cpufreq) that use the 
Linux file system /sys/devices/system/cpu/cpu*/cpufreq. Also 
we use the acpi-cpufreq driver which provide the Userspace 
governor that allow user to change the frequency inside a 
discrete range. To disable the cores likewise using Linux 
virtual files on /sys/devices/system/cpu/cpu*/online. 

On some systems the frequency scaling on software may 
be disable and the folder /sy s/devices/sy stem/cpu/cpu */cpufreq 
does not exists, in this case it’s necessary to enable frequency 
scaling by software on the Basic Input Output System (BIOS). 

To monitor the program execution and collect information 
related to power, voltages, currents and temperatures of the 
entire system we create a python module that collects this data 
from IPMI web service (https://github.conVVitorRamos/ipmi). 

IPMI [11] defines a set of interfaces used by system ad¬ 
ministrators for out-of-band management of computer systems 
and monitoring platform status through local network. The 
sub-system consists of a main controller, called the baseboard 
management controller (BMC) and other management con¬ 
trollers distributed among different system modules, as shown 
on the figure 1. 
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Fig. 1. IPMI block diagram 


The applications used was from the PARSEC benchmark 
suite composed of parallel programs. The suite focuses on 
emerging workloads and was designed to be representative 
of the next generation shared-memory programs for chip- 
multiprocessors. 

B. The Applications 

The PARSEC benchmarks [3] cover a ample range of 
areas such as financial analysis, computer vision, engineering, 
enterprise storage, animation, similarity search, data mining, 
machine learning, and media processing. Vary in type of 
parallelization model (data-parallel or pipelined), working set 
(ranging from small to huge), and communication intensity 
(varying between low and high). For each benchmark several 
data input sets are available, including Test, Simdev, Sims- 


mall, Simmedium, Simlarge, and Native (with Test being the 
smallest and Native being the largest input set). 

In our experiments we use PARSEC version 3.0 and the ap¬ 
plications used was Blackscholes, Fuidanimated and Ray trace. 

Blackscholes is an Intel RMS benchmark. It calculates the 
prices for a portfolio of European options analytically with the 
Black-Scholes partial differential equation. There is no closed- 
form expression for the Black-Scholes equation and as such 
it must be computed numerically. Their inputs are number of 
threads, input file and output file name. 

Fuidanimated also Intel RMS application uses an extension 
of the Smoothed Particle Hydrodynamics (SPH) method to 
simulate an incompressible fluid for interactive animation 
purposes. Their inputs are the number of threads the number 
of frames and the input file. 

Raytrace intel RMS application as well uses a version of 
the raytracing method that would typically be employed for 
real-time animations such as computer games. It is optimized 
for speed rather than realism. The computational complexity 
of the algorithm depends on the resolution of the output image 
and the scene. 

III. Characterization Process 

To characterize the application we run tests varying number 
of active cores on a rage from 1 <= P <= 32, depending 
on the application, because some applications can only run 
with multiple threads of 2 and others the number of threads 
has to be even. The frequency range was the same for all 
1.2 <= F <= 2.2 with 100 MHz steps and the program 
input, 6 different input sizes for each application. 

The simplified algorithm: 

for f in frequency list do 
change cpu frequency to f 
for p in thread list do 
enable only p cores 
for arg in list of arguments do 
run program with arg 
while program is running do 
collect sensor data 
collect time data 
end while 
end for 
end for 

end for 

The input sizes was chosen such way that faster execution 
(32 cores and maximum frequency) take a few seconds and 
the longest (1 core and minimum frequency) a few hours to 
complete, sampling the power information every second. The 
total time to complete the characterization varied between one 
and three days. With the data collected we can use the SVR 
to estimate the time. 

SVR was original designer to be a linear machine. However, 
in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir 
N. Vapnik suggested a way to create nonlinear classifiers by 
applying a modification to the kernel know as the kernel 
trick. The resulting algorithm is formally similar, except that 
















every dot product is replaced by a nonlinear kernel func¬ 
tion. This allows the algorithm to fit the maximum-margin 
hyperplane in a transformed feature space Some common 
kernels include Polynomial (homogeneous), Polynomial (inho¬ 
mogeneous), Gaussian radial basis function (RBF), Hyperbolic 
tangent. 

Using the SVR implementation from skleam [12] a open 
source python machine learning for python. To find the best 
parameters to it we use a grind search across several parame¬ 
ters and pick the best ones. In this case a RBF kernel and the 
penalty for wrong term of 10 * 10 3 and gamma 0.5. To train 
the SVR we split the data collected from the monitor script 
into 90% to training and 10% for testing the accuracy. 

A. Evaluation of the model 

To evaluate the quality of the model we calculate the mean 
absolute error for the training and test set. 


TABLE I 

Mean absolute error 


Application 

Test 

Training 

Blackscholes 

2.48 

0.98 

Fluidanimate 

1.43 

0.88 
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IV. The Power model and fitting 

There are many factors contributing to the CPU power 
consumption [5], [8], [14] they include dynamic power con¬ 
sumption, short-circuit power consumption, and power loss 
due to transistor leakage currents. But as they have very 
complex circuits an approach to model the CPU power draw 
is taking advantage of his components constitution since they 
are mainly consists of logic gates. Calculating the power 
consumption for one logic gate and multiply by the total 
number of gates provide a good estimative. 

The modern technology to construct logic gates is the 
CMOS and there are three main components of power dis¬ 
sipation in digital CMOS circuits. 

Ptotal Pstatic T Pleak % Pdynamic (1) 

According to [4], [15], the dynamic power and leakage 
power behavior according to the equations (3), (4). 

Another common approximation that we going to use is 
that: 

focV (2) 

F 'dynamic = CV 2 f OC / 3 (3) 

Pleak OC V OCf (4) 

Where C is the CMOS capacitance, V the voltage apply to 
the circuit and f the switching frequency. 

Replacing the equations (3), (4) on (1) the model is: 

Ptotal (/) =c 1 / 3 + c 2 / + c 3 (5) 

Include the number of cores active (p) the estimation 
become: 

Ptotal(f,P ) = P(cif + C 2 f )+ C 3 (6) 

Because our system has more than one socket we also have 
to include the cost to enable the each socket, including the 
variable (s) on the equation, the final model is: 

Ptotal (/, P, S ) = P(ci/ 3 + C 2 f ) + C 3 + C 4 S (7) 

To fit this equation we run a program that use 100% of 
cpu for all combination of frequency starting from 1.2 to 2.2 
GHz increasing 100 MHz and active cores starting from 1 
to 32 increasing 2. And we collect the power information of 
the IPMI sensors for about for 100 second on total, with a 
sampling rate of 1 second. 

We find the coefficients ci, C2, C3 and C4 using multi-linear 
regression. 

The result of the fitting we can see on figure 2 and the 
absolute mean error was 2.45. 


Ptotal (/ 1 Pi S) = 0.27 fp + 1.16 fp + 205.56 + 4.00s (8) 
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Fig. 2. Fitting 

V. The energy calculation and minimal energy 

CONFIGURATION 

Combine the power model of the previous section and the 
SVR characterization of the time we can calculate the total 
energy consumed with the equation (9). 

E(f,p,N)=P(f,p,s)SVR(f,p,N) (9) 

With this equation its possible to find the configuration 
where the energy is minimized for a given input but also 
find minimum energy configuration given a restriction of time, 
frequency or number of active cores. 

In our tests we look for the global minimum that we found 
by calculate the energy value for all combinations of frequency 
and number of active core since both variables vary discrete 
in our system. 

VI. Results 

Calculating the energy equation for the parsec application 
using our model we found that observation only the variation 
on the number of cores depending on how the application 
scales there is a threshold where the decrease in execution 
time due to the parallelization stop compensates the highest 
energy consumption for using more cores and apart from that 
threshold the power consumption does not vary much, as we 
can see on figures bellow. 

Looking only at the frequency variation the energy con¬ 
sumption has greater significance when the program runs 
with less active cores but even the difference being not very 
significant the total energy of the best configuration always 
tended to higher frequency because execution time decreases 
faster than power increases when raising the frequency in 
our system. This relation can change depend on the power 
consumption of the system. 

In resume, the configuration for the best number of colors 
depends on the scale of the program and the frequency depends 
on the model of the energy consumption. 

The figures bellow 4(a), 4(b), 5(a), 5(b), 6(a) and 6(b) show 
that the model predict correctly behavior of the energy. The 
minimum predict for each input and the real consumption 
we can see on the tables II, III and IV. Notice that the 


predicted values are always higher than the real ones because 
the modeling was done with 100% cpu usage, so the predicted 
values will always be higher because the program does not use 
all cpu but the trend is correct and what we are interested to 
find the configuration with less energy consuming. 

Fig. 3. Blackscholes 
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Fig. 4. Fluidanimated 
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(b) Energy consumption for input size 2K 


Fig. 5. Raytrace 
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TABLE III 

Minimal energy fluidanimated 


Input 

Frequency 

Cores 

Estimated energy 

Consumed energy 

1 

2.1 

32 

1258 

1085 

2 

2.1 

32 

2730 

2410 

3 

2.1 

32 

4487 

4258 

4 

2.1 

32 

9461 

8061 

5 

2.1 

32 

18453 

16305 

6 

1.9 

32 

37789 

34446 

7 

2.1 

32 

76506 

67498 


TABLE IV 

Minimal energy raytrace 


Input 

Frequency 

Cores 

Estimated energy 

Consumed energy 

1 

2.2 

4 

34844 

34369 

2 

2.2 

6 

39197 

37918 

3 

2.2 

10 

43398 

39933 

4 

2.2 

14 

51243 

45769 

5 

2.2 

22 

62956 

52990 

6 

2.2 

26 

80097 

67280 
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(a) Energy consumption for input size 500 
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(b) Energy consumption for input size 4000 


TABLE II 

Minimal energy blackscholes 


VII. Comparative 

To know how good is this model we compare with the 
conservative governor on Linux, even though this governor 
changes the frequency dynamically on the program runtime 
and our change once on the program start. We compare the 
total energy consumed with our minimal configuration found, 
running with the cores 1,2,4,8,16,32 for the conservative since 
they don’t vary the active cores. 


black input in 10M.txt 



Fig. 6. Comparative Black 

The results demonstrate to be very interesting we achieved 
very close results and some even better as we can see on 6, 
7 and 8. This show that this algorithm can be enough to save 
energy. 


Input 

Frequency 

Cores 

Estimated energy 

Consumed energy 

1 

1.5 

30 

1162 

981 

2 

2.1 

32 

1997 

1690 

3 

2.2 

28 

4085 

3453 

4 

2.2 

30 

7926 

6551 

5 

2.2 

32 

14860 

13489 

6 

2.2 

28 

30392 

26515 
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Fig. 7. Comparative Fluid 
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Fig. 8. Comparative Ray Trace 


VIII. Conclusion and future work 

Using the knowledge about the runtime of the application 
and a model of the power of the computer node, that are 
information that the current scheduler algorithms does not 
has, produce good results that can make programs run more 
energetic efficient. This can be applied on HPC systems where 
the applications run exclusive on the computer just varying the 
input size. 

Also we pretend to include one more variable to this 
model the percentage of cpu utilization, with this new variable 
its possible to make our algorithm dynamically change the 
frequency find functions of frequency and number of cores 
over time that can produce even better results. 

We also pretend to estimate the data size input to develop 
a complete system that characterize a application and set the 
best configuration automatically every time that application is 
called. 
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